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EVALUATING WRITING: 
LINKING LARGE-SCALE TESTING 
AND CLASSROOM ASSESSMENT 

Sarah Warshauer Freedman 
University of California at Berkeley 

Robert Hogan, then executive director of the National Council of Teachers of 
English, opens his preface to Paul Diederich's 1974 book Measuring Growth in English 
with the following words: 

Somehow the teaching of English has been wrenched out of the Age 
of Acquarius and thrust into the Age of Accountability. Many of us view 
educational accountants in much the same spirit as we view the agent of the 
Internal Revenue Service coming to audit our returns. Theoretically, it is 
possible the agent will turn out to be a pleasant person, gregarious and 
affable, who writes poetry in his free time and who will help us by showing 
how we failed to claim all our allowable deductions, so that the result of the 
audit is the discovery of a new friend and a substantial refund. But 
somehow we doubt that possibility. 

For the specialist in measurement and testing we have our image, 
too. In his graduate work, one of the foreign languages he studied was 
statistics. And he passed it. The other one was that amazing and arcane 
language the testing specialists use when they talk to one another. He 
passed it, too, and is fluent in it. He doesn't think of children except as 
they distribute themselves across deciles. He attempts with his chi-souares 
to measure what we've done without ever understanding what we were 
trying to do. (p. iii) 

Most English teachers, I suspect, would still agree with Hogan's remarks. I will focus in 
this paper on bridging this rather wide gap between teachers of writing and the testmg and 
measurement community. I will focus on two currently distinct kinds of writmg 
evaluation— /ar^e-jfcfl/c testing at the national, state, district, and sometimes school levels, 
the natural domain of die educational accountants, and classroom assessment by teachers 
looking at their own students inside their own classrooms, teachers who sec kids and not 
distributions of deciles but whose judgments, according to measurement specialists, may 
be unrehable and biased.^ In writing, as in most areas of the curriculum, large-scale testing 
and classroom assessment normally serve different purposes and quite appropriately 
assume different forms. However, if we could create a tight fit between large-scale testing 
and classroom assessment, v/e could potentially add to the kinds of information we now 
get from large-scale testing programs, and we could help teachers strengthen their 
classroom assessments and thereby their teaching and their students' learning. 



^In this paper the terni testing will refer to large-scale standardized evaluation and assessment will refer to 
the evaluative judgments of the classroom teacher. Calfec (1987) describes tcsUng activiUes as usually 
"group administered, multiple choice, mandated by external authorities, used by the public and policy 
makers to decide 'how the schools are doing'" while assessment activities include "evaluation of mdividual 
student performance, based on the teacher's decisions about curriculum and insmicUon at the classroom 
level, aimed toward the student's grasp of concepts and mastery of iransferrable skills (Calfee and Drum, 
1979)" (p. 738). 
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Before presenting some ideas for Unking large-scale testing and classroom 
assessment, I will provide background about the form of most large-scale writing tests and 
will discuss their limitations. I will then describe portfolio assessment, an important 
innovation in classroom writing evaluation that is filtenng up in some cases to the state 
level and now even to the National Assessment of Educational Progress (NAEP). Portfolio 
assessment contains the foundations for potential formal links between large-scale testiirg 
and classroom assessment levels. Finally, I will give several examples of portfolio 
progf anis at work, examples that ! fmd helpful as I think about possible future directions 
for writing as!»:ssment and instruction in this country: a large-scale, classroom -centered 
portfolio effort for elementary students in England, The Primary Language Record; a state- 
level portfolio assessment from Vermont for ^des 4 and 11 ; and a large-scale national 
examination at the secondary level in Great Britain, the General Certificat^t of Secondary 
Education (GCSE). 

Large-scale Testing 

Historically, the large-scale testing of writing has developed to fulfill a number of 
purposes: (a) to certify that students have masterfid writing at some level (e.g., the 
Natioiial Assessment of Educational Progress); (b) to evaluate writing programs in the 
school, district, or in some cases classroom (e.g., the California Assessment Fix)gram); (c) 
to place students in programs or classes (e.g., many college-level placement examinations 
given to freshmen); (d) to decide the fate of individuals with respect to admissions, 
promotion, or graduation ("gatekeeping") (e.g.. the SAT, high school graduation tests, 
writing samples gathered by potential employers). Unlike classroom assessment, large- 
scale testing generally has not been concerned with charting the development of individual 
writers. 

Across the years, large-scale testing programs have struggled with a difficult 
problem: how to evaluate student writing reliably and cost-effectively. One highly 
criticized but commonly used way is through indirect measures designed to provide proxies 
for writing abilities. Indirect measures arc generally multiple-choice tests and tyjpically 
include questions about grammar or sentence structure or scrambled paragraphs to be 
rearranged in a logical order. These indirect measures are in widespread use; in 1984, 19 
states measured writing indirectiy while only 13 had direct measures, and 18 had no 
measures at all (Burstein et al., 1985, in Baker, 1989). The appeal of indirect measures of 
writing is obvious; they're quick to administer and cheap to score, llie problems are 
obvious too; indirect measures are poor predictors of how well the test-taker actually 
writes. According to Gertrude Conlan (1986), long-time specialist in writing assessment at 
Educational Testing Service: 

No multiple-choice question can be used to discover how well students can 
express their own ideas in their own words, how well they can marshal 
evidence to support their arguments, or how well they can adjust to the need 
to communicate for a particular purpose and to a particular audience. Nor 
can multiple-choice questions ever indicate whether what the student writes 
will be interesting to read. (p. 124) 

And if we believe Resnick and Resnick (1990) tiiat "[y]ou get what you assess," multiple- 
choice writing tests will have negative effects on instruction since teaching to the test would 
not include asking students to write. 

From 1890 on into the 1960s the College Entrance Examination Board (CEEB) 
struggled to find practical ways to move away from multiple-choice, indirect measures of 
writing. The goal was to design direct assessments that would include the collection and 



scoring of actual samples of student writing (Diederich, French, & Carlton, 1961; 
Godshalk, Swineford, & Coffman, 1966; Huddleston, 1954; Meyers, McConville, & 
Coffman, 1966). CEEB's struggles were many. First of all, the student writing would 
have to be evaluated. Besides the expense of paying humans to score actual writing 
samples, it proved difficult to get them to agree with one another on even a single general- 
impression score. In 1961 Diederich, French, and Carleton at the Educational Testing 
Service (ETS) conducted a study in which "sixty distinguished readers in six occupational 
fields" read 300 papers written by college freshmen (in Diederich, 1974, p. 5). Of the 300 
papers, "101 received every grade from 1 to 9" (p. 6). On as many papers as they could, 
the readers wrote brief comments about what they liked and disliked. These comments 
helped ETS researchers understand why readers disagreed. 

During the 1960s ETS and the CEEB developed ways of training readers to agree 
independently on "holistic" or general impression scores for student writing, thus solving 
the reliability problems of direct assessment (Cooper, 1977; Diederich, 1974). For this 
scoring, readers are trained to evaluate each piece of student writing relative to the other 
pieces in the set, without consideration of standards external to the examination itself 
(Chamey, 1984). Besides figuring out how to score the writing reliably, the testing 
agencies also figured out ways to collect writing samples in a controlled setting, on 
assigned topics, and under timed conditions. Witii tiie practical problems solved and 
routines for testing and scoring in place, the door opened to the current, widespread, large- 
scale, direct assessments of writing (Davis, Scriven, & Thomas, 1987; Diederich, 1974; 
Faigley et al., 1985; Myers, 1980; White, 1985). 

When direct writing assessments were relatively novel, the profession breathed a 
sigh of relief that writing could be tested by having students write. Diedcrich's opening to 
his 1974 book typified the opinions of the day: 

As a test of writing ability, no test is as convincing to teachers of 
En,^lish, to teachers in other departments, to prospective employers, and to 
the public as actual samples of each student's writing, especially if the 
writing is done under test conditions in which one can be sure that each 
sair ple is the student's own unaided work. (p. 1) 

However, Diederich's words sound dated now. With large-scale direct assessments of 
writing in widespread use, educators are already raising questions about thcii' validity, just 
as tiiey did iuid continue to do for the indirect measures provided by multiple-choice tests. 
Many tensions center around tiie nature of test-writing itself. Although controlled and 
written under unaided conditions, as Diederich points out, such writing has little function 
for students other than for them to be evaluated. Too, students must write on topics tiiey 
have not selected and may not be interested in. Further, they are not given sufficient time 
to engage in the elaborated processes that are fundamental to how good writers write and to 
how writing ideally is taught (Brown, 1986; Lucas, 1988a,b; Simmons, 1990; Witte et al„ 
in press). In short, tiie writing conditions are "unnatural." Finally, educators often make 
claims about writing in general and students' writing abilities based on one or perhaps a 
few kinds of writing, written in one kind of context, the testing setting. 

Current debates surrounding the NAEP writing assessment provide important 
illustrations of tiie tensions surrounding most large-scale, direct writing assessments. The 
goal of tiie NAEP assessment is to provide at five-year intervals "an overall portrait of the 
writing achievement of American students in grades 4, 8, and 1 1" (1990b, p. 9) as wellas 
to n-iark changing "trends in writing achievement" across the years (1986a, p. 6). The 
National Assessment gathers informative, persuasive, and imaginative writing samples 
from students at the three grade levels. For eighth- and twelfth-graders, the test "is divided 



into blocks of approximately 15 minutes each, and each student is administered a booklet 
containing three blocks as well as a six-minute block of background questions common to 
all students" (1986a, p. 92). During a 15-minute block, students write on either one or two 
topics. For fourth-graders, the blocks last only 10 minutes (National Assessment of 
Educational Progress, 1990a). This means that fourth-graders have had between 5 and 10 
minutes to produce up to four pieces of writing during a 30-minute test; eighth- and 
twelfth-graders have had between 7 1/2 and IS minutes to prcxluce up to four pieces during 
a 4S-minute test (National Assessment of Educational Progress, 1990a). 

For good reason, writing researchers and educators have critiqued the National 
Assessment, arguing that it is not valid to make claims about the writing achievement of our 
nation's schoolchildren given the NAEP testing conditions, especially the short time 
students have for writing, and given the way the writing is evaluated (e.g., see Mellon, 
1975; Nold, 1981; Silbcrman, 1989). With respect to the testing conditions, the NAEP 
report writers themselves caution: 

The samples of writing generated by students in the assessments represent 
their ability to produce first-draft writing on demand in a relatively short 
time under less than ideal conditions; thus, the guidelines for evaluating task 
accomplishment are designed to reflect these constraints and do not require a 
finished performance. (1990b, p. 7) 

Based on NAEP writing data, how confident can we be in the following claim made in The 
Writing Report Card: "A major conclusion to draw from this assessment is that students at 
all grade levels are deficient in higher-order thinking skills" (1986a, p. 11)? Can students 
possibly reveal their higher-order thinking skills in 15 minutes when writing on an 
assigned topic that they have never seen? 

In stark contrast to most testing conditions and consistent with our sense of how 
writing can be used to support the development of sophisticated higher order thinking, the 
pedagogical and research literature in writing from the past decade shows that higher-order 
thinking occurs when there is an increased focus on a wridng process which includes 
encouraging students to take lots of time with their writing, to think deeply and write about 
issues in which they feel some investment, and to make use of plentiful response from both 
peers and teachers as they revise (Dyson & Frccdman, in press; Frcedman, 1987). Most 
tightly timed test-type writing goes against current pedagogical trends. What Mellon 
(1975) pointed out about the NAEP writing assessment some 15 years ago remains true 
today: 

One problem with the NAEP essay exercises, which is also a 
problem in classroom teaching, is that ihe assessors seem to have 
underestimated the arduousness of writing as an activity and consequentiy 
overestimated the level of investment that unrewarded and unmotivated 
students would bring to die task. After all, the students were asked to write 
by examiners whom they did not know. They were told that their teachers 
would not sec their writing, tiiat it would not influence Uieir marks or 
academic futures, and presumably that they would receive no feedback at all 
on their efforts. 

Clearly this airangcmcnt was meant to allay the students' fears, but 
its effect must have been to dcmotivate them to some degree, though how 
much is anyone's guess. We all know that it is difficult enough to devote a 
half hour's worth of interest and sustained effort to writing externally 
imposed topics carrying the promise of teacher approbation and academic 
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marks. But to do so as a flat favor to a stranger would seem to require more 
generosity and dutiful compliance than many young people can summon up. 

. . . Answering multiple choice questions without a reward in a 
mathematics assessment or a science lesson may be one thing. Giving of 
the self what one must give to produce an effective prose discourse, 
especially if it is required solely for purposes of measurement and 
evaluation, is quite another, (p. 34) 

NAEP is attempting to respond to these criticisms about the time for the testing. In 
1988 NaIeP gave a subsample of the students twice as much time on one informative, 
persuasive, and imaginative topic at each grade level (20 minutes for grade 4 and 30 
minutes for grades 8 and 12) (National Assessment of Educational Progress, 1990a). The 
results show that with increased time all students scored significantly better on tiic narrative 
tasks and fourth- and twelfth- graders scored significantiy better on the persuasive tasks; 
only tiie informative tasks showed no differences. Most disturbing, tiie extra time proved 
more helpful to White students than to Blacks or Hispanics, widening the gaps between 
these groups in the assessment results. 

For tiie 1992 assessment, NAEP plans to provide more time across the board: 

As a result of bot .i tiie findings from tiiis study and tiie desire to be 
responsive to the latest dsvelopments in writing instruction and assessment, 
tiie response time will be increased for all writing tasks administered in the 
1992 NAEP assessment. At grade 4, students will be given 25 minutes to 
perform each task, and at grades 8 and 12, students will be given eitiicr 25 
or 50 minutes. These tasks will be designed to encourage students to 
allocate their time across various writing activities from gathering, 
analyzing, and organizing their thoughts to communicating tiiem in writing. 
(1990a, p. 87) 

Providing 25 or even 50 minutes for writing on a given topic will probably prove 
insufficient to quiet NAEP critics, since even tiiat amount of time will not resolve tiie basic 
discrepancy between what we argue should be happening in classrooms and what happens 
in this testing setting. Furthermore, the findings about Blacks and Hispanics raises a new 
set of questions about equity and testing, not to mention equity in classroom opportunities 
to learn. Besides tiie double time, NAEP is also collecting portfolios of student writing 
produced as a natural part of writing instruction. The assessors have not yet decided how 
to evaluate tiie portfolios, but these data promise to provide important supplementary 
information for tiie Assessment. It will be important to remember tiiat as tiie Assessment 
changes, tiie only way to collect data about trends across time will be to keep some parallel 
tasks. Thus, 15-minute samples will still be used for tiie trend studies and conclusions 
about trends will be based on these very short samples. 

Another major point of tension in tiie National Assessment centers around tiie issue 
of scoring. In an effort to obtain more information tiian a single holistic score and to define 
clearly the features of writing being judged, in the mid 1970s NAEP developed an 
additional scoring system, "the Primary Trait Scoring metiiod" (Lloyd Jones, 1977, p. 33). 
While the criteria forjudging writing holistically emerge from tiie writing tiic students do, 
tiie goal of primary trait scoring is to set specific criteria for successful writing on a 
particular topic ahead of time. The primary trait is determined and defined by tiie test- 
maker who decides what will be essential to writing successfully on each topic on tiie test. 
Traits vary depending on the topics. Tensions arise because the test-makers cannot always 
anticipate precisely what test-takers will do to produce good writing on a particular topic, 



and what is primary or whether one aspect of writing should be labeled primary is a 
debatable point 

The dilemmas come across clearly through an analysis of Lloyd Jones's (1977) 
example of a primary trait scoring rubric. Lloyd Jones explains that in one NAEP prompt 
children were to write about the following: ''Some people believe that a woman's place is 
in the home. Others do not. Take ONE side of this issue. Write an essay in which you 
state your position and defend it" (p. 60). The directions for scoring this trait show the 
conflicts that are likely to emerge between a primary trait and a holistic score representing 
the general quality of the student's writing. The writing receives a 0 score if the writer 
gives no response or a fiagmented response; it receives a 1 score if the writer does not take 
a clear position, takes a position but gives no reason, restates the stem, gives and then 
abandons a position, presents a confused or undefined position, or gives a position without 
reasons; it receives a 2 if the writer takes a position and gives one unelaborated reason; it 
receives a 3 if the writer takes a position and gives ons elaborated reason, one elaborated 
reason plus one unelaborated reason, or two or three unelaborated reasons; it receives a 4 if 
the writer takes a position and gives two or more elaborated reasons, one elaborated reason 
plus two or more unelaborated reasons, or four or more unelaborated reasons. 

What happens to the student who does not follow directions to take "ONE" position 
on a woman's place but points out the complexity of the issue rather than taking a side, 
perhaps showing how a woman has many places, in the home and out? This student 
would receive a 1 score but might write a substantially better essay than a student who 
receives a 2, 3, or 4 score for taking a side and providing one or more reasons. In another 
scenario a student who gives one elaborated reason for a 3 score could write a far better 
essay than the student who gives four or more unelaborated reasons and receives a 4. 
NAEP scoring rubrics seem to have gotten less specific and therefore less controversial 
over the years. 

Besides these issues of judging elaboration particular to this scoring rubric, the 
primary trait score only measures one aspect of writing. By contrast, a holistic score takes 
into account the whole piece — including its fluency, sentence structure, organization, 
coherence, mechanics, and idea development. Indeed, in a study comparing holistic and 
primary trait scoring, NAEP found that primary trait scoring docs not correlate particularly 
well with holistic quality judgments; correlations ranged from .38 to .66 depending on the 
topic (1986a, p. 84). Freedman (1979) found that holistic scores are based primarily on 
how well writers develop their ideas and then organize tiiem, but once writers do a good 
job at development and organization, then the rater counts syntax and mechanics. 

Whereas NAEP uses a holistic score, a primary trait score, and a mechanics score 
for its trends reports (1986b, 1990b), NAEP uses only primary trait scoring for the reports 
on the status of writing for a given year (1986a, 1990a). In the latest status report, NAEP 
(1990a) explains, "The responses were not evaluated for fluency or for grammar, 
punctuation, and spelling, but information on these aspects of writing performance is 
contained in the writing trend report" (p. 60). 

At tile state level tiie issues in large-scale, direct, writing assessment are similar to 
those illustrated by die debates surrounding NAEP. States with direct writing assessments 
are facing tiie same challenges as NAEP, and several states are meeting die challenges in 
interesting ways. For example, let's look at the case of Alaska (Calkins, personal 
correspondence). Two years ago in an effort to increase accountability the Alaska state 
school board mandated the Iowa Test of Basic Skills for grades four, six, and eight. The 
Iowa test, developed in 1929, contains multiple-choice items in grammar and sentence 
structure, but the introduction to the test explicitiy says tiiat it is not designed to test writing 



skills. Alaska teachers of writing are well organized through the Alaska Writing 
Consortium, an affiliate of the National Writing Project, and with strong leadership in the 
State Department of Education. Open to the accountability concerns of the State Board and 
anxious to learn about the fruits of their classroom efforts, Consortium members proposed 
a direct writing assessment that would yield information about students' writing 
achievement beyond whatever other information the Iowa test might provide. The state 
funded an experiment at the tenth-grade level, and in 1989-1990 twelve districts 
participateid voluntarily. The writing was scored with an analytic scale, tiic tiiird metiiod 
besides primary trait, and holistic scoring that is commonly used in large-scale, direct 
writing assessments. The analytic scale offers more information than a single holistic score 
but avoids some of the problems associated with primary trait scoring.^ The analytic scale 
differs from primary trait because tiie categories are generic to good writing and are thus 
independent of a given topic. On tiiis scale raters give separate scores on ideas, 
organization, wording, flavor, usage and sentence structure, punctuation and other 
mechanics, spelling, and handwriting (Dicderlch, 1974). An analytic scale is used by the 
International Association for the Evaluation of Educational Achievement (lEA) studies of 
written language (Gorman et al., 1988; Gubb et al., 1987). 

For the Alaska test, teachers also wanted to maintain some condol over tiie testing 
conditions while allowing students more natural and comfortable writing conditions tiian is 
usual for large-scale, formal assessments. Thus, students were given a common prompt 
but were allowed twc 50-minute time blocks on separate days to complete die writing. For 
tiie Alaska experiment 60 papers from each of the districts were scored, enough writing to 
provide a substantial amount of information about student writing beyond what the state 
board could get from the Iowa test tiiat tiiey were using. In particular tht direct testing 
showed tfiat knowledge of sentence structure does not guarantee good ideas. The board 
also learned tiiat direct assessments were easy to administer and cost-effective. This past 
year 22 districts out of Alaska's 54 districts volunteered to participate, and Alaska teachers 
arc experimenting with other assessment alternatives as well. To these alternatives, 
emerging mostly from the classroom up, I will now turn. 

New Directions: Writing Portfolios 

The portfolio movement provides a potential link between large-scale testing and 
classroom assessment and teaching, and could serve as an impetus for important reforms 
on all fronts, bringing togetiier Hogan's accountants or IRS agents and tiie teachers whom 
tiiey audit. Mostiy classroom-based and designed to provide information about student 
growtii, portfolios really are not much more than collections of student writing. They have 
long been a staple of many informal classroom assessments marked by careful teacher 
observation and careful record keeping (e.g., anecdotal records, folders of children's work 
samples). Through such techniques, student progress is revealed by patterns in behaviors 
over time (British National Writing Project, 1987; Dixon & Stratta, 1986; Genishi & 
Dyson, 1984; Graves, 1983; Jaggar & Smitii-Burke, 1985; Newkirk & Atwell, 1988; 
Primary Language Record, 1988). Using folders as a basis for discussion, teachers can 
easily involve students in the evaluation process (Bumham, 1986; Graves, 1983; Primary 
Language Record, 1988; Simmons, 1990; Wolf, 1988), discussing witii tiiem tiieir ways 
of writing and tiieir products, articulating changes in processes and products over time and 
across kinds of writing activities; students are tiius helped to formulate concepts about 



2Thc analytic scale may not actually give much more information than a holistic scale. Frecdman (1981) 
found that all the categories except usage were highly correlated. Freedman modified Dicderich's scale by 
combining usage with spelling and punctuation and making separate categories for sentence structure and 
word choice. 
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"good" writing, including the variability of "good" writing across situations and audiences 
(Gere & Stevens, 1985; Knoblauch & Brannon, 1984). 

Beyond the uses of portfolios in writing classrooms, they are being piloted in a 
number of other educational assessment contexts, from mathematics assessments to arts 
assessments to teacher assessments in the form of pilot tests for certifying teachers through 
the planned National Board for Professional Teachin| Standards. In a discussion of the 
uses of portfolios to assess teachers. Bird (1988) considers the implications of borrowing 
the portfolio metaphor from other professions (e.g., art, design, photography). Bird 
argues that the educational uses of portfolios are in need of definition. For other 
professions, including professional writing, conventions define the nature and contents of a 
portfolio. In education there are no such conventions, and so according to Bird, "[T]he 
borrowed idea of 'portfolio' must be reconstructed for its new setting" (p. 4). Bird's 
concerns become particularly important if we begin to consider possible large-scale uses of 
portfolios. A survey of the literature on writing portfolios readily reveals that most 
portfolio projects lack guidance on several fundamental fronts: what writing is to be 
collected, under what conditions, for what puiposes, and evaluated in what ways. Murphy 
and Smitii (1990) outline a set of questions tiiat must be answered by anyone designing a 
portfolio project: "Who selects what goes into the portfolio?" "What goes into the 
portfolio?" "How much should be included?" "What might be done with die portfolios?" 
"Who hears about the results?" "What provisions can be made for revising the portfolio 
program?" (p. 2). 

As the fundamental nature of the questions indicate, portfolio assessment is fmding 
its way into practice well before the concept has been defined. Wiggins (1990) explains 
that people arc "doing" portfolios, but the operational definitions range broadly, the 
purposes vary widely, and as Bird (1988) points out, the underpinnings are metaphorical 
more than analytic and most likely "the potential of portfolio procedures depends as much 
on the political, organizational and professional settings in which they are used as on 
anything about the procedures themselves" (p. 2). Camp (1990) lists several essential 
features which contain implications for the kinds of writing and thinking activities that will 
have to accompany portfolios and that will influence the professional setting: 

multiple samples of classroom writing, preferably collected over a sustained 
period of time; 

evidence of the processes and strategies that students use in creating at least 
some of tiiose pieces of writing; 

evidence of the extent to which students are aware of the processes and 
strategies they use in writing and of their development as writers, (p. 10) 

Still, die unifying theme is little more tiian "collecting 'real' student work," including 
information about students* processes and their reflections on their work. 

Before turning to Uie potential of portfolios to inform large-scale testing, I will first 
illustrate the concept by showing how portfolios are being integrated into a school system. 
Wolf (1988, 1989d,b) writes about Arts PROPEL, a school-district portfolio project in art. 
music, and imaginative writing designed as a collaborative with the Pittsburgh public 
schools. Harvard's Project Zero, and the Educational Testing Service. Arts PROPEL aims 
eventually to provide "alternatives to standardized assessment" (Wolf, 1989a), but first is 
exploring the power of portfolios to impact teaching and learning, to change educational 
settings: 
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Central to this work [the portfolio project] arc two aims. The first is to 
design ways of evaluating student learning that, while providing 
information to teachers and school systems will also model [the student's) 
personal responsibility in questioning and reflecting on one's own work. 
The second is to find ways of capturing growth over time so that students 
can become informed and thoughtful assessors of their own histories as 
learners, (p. 36) 

According to Wolf, teachers in Arts PROPEL are concerned with the following 
important questions underlying thoughtful pedagogy, appropriate assessment, and 
professionalized school settings: 

• How do you generate samples of work which give a genuine picture of 
what students can do? 

• How do you create "three-dimensional" records— not just of production, 
but of moments when students reflect or interact with the work of other 
writers and artists? 

• How do you invite students into the work of assessment so that they learn 
life-long lessons about appraising their own work? 

• How could the reading of portfolios turn out to be a situation in which 
teachers have the opportunity to talk with one another about what they value 
in student work? About the standards they want to set; individual 
differences in how students develop: conflicts between conventions and 
inventions? (1989b, p. 1.) • 

Wolf is quick to point out the importance of taking such questions seriously: 

Portfolios are not MAGIC. Just because students put their work into manila 
folders or onto tapes, there is no guarantee that the assessment that follows 
is wise or helpful. The assignments could be lockstep. Students could be 
asked to fill out worksheets on reflection. The portfolio could end up 
containing a chronological sample of short answer tests. Scoring might be 
nothing more than individual teachers counting up assignments or taking off 
points for using the wrong kind of paper, (p. 1) 

Currently, the Arts PROPEL portfolio data are not used for any assessment purpose 
beyond classroom teaching and school-level coordination of information. 

Moving Toward Large-Scale Portfolio Use: In Schools* in State Testing 
Programs, and for National Examinations in Great Britain 

How can we begin to link classroom portfolios to assessment and testing goals 
beyond the classroom? A start of an answer comes from a second example of portfolios in 
classroom use, but on a larger-scale than Arts PROPEL and with some attempts at 
standaidization of information collected: The Primary Language Record (PLR), developed 
in Great Britain. The PLR is designed to introduce systematic record-keepmg about 
language growth, a kind of portfolio, into all elementary classrooms in the U.K. The PLR 
was written by a committee of teachers and administrators at varied levels and piloted in 
more than 50 schools to refine the final version. The classroom teacher collects the 
portfolios for three reasons: "to inform and guide other teachers who do not yet know tfie 
child; to inform the headteacher and others in positions of responsibility about the child s 
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work; to provide parents with information and assessment of the child's progress" (1988, 
p. 1). The British argue that all assessment should be formative and quahtative until the 
end of secondary school and hence the PLR is designed as a qualitative assessment tool, 
but one that provides specific directions and even standard forms on which to collect and 
record children's language growth. 

For the writing portion of the record, teachers are asked to "Record observations of 
the child's development as a writer (including stories dictated by the child) across a range 
of contexts" (p. 44). Teachers are directed to consider: 

— the child's pleasure and interest in writing 

— the range and variety of her/his writing across the curriculum 

— how independent and confident the child is when writing 

— whedier the child gets involved in writing and sustains that involvement 
ovcT time 

— the child'^ willingness to write collaboratively and to share and discuss 
her/his writing 

— the understanding the child has of written language conventions and the 
speUing system (p. 44) 

Teachers ?jt also asked to record observations about children's writing samples at least 
"once a term or more frequently" (p. 50).^ The writers of the PLR note that "Many schools 
already collect examples of children's writing in folders which become cumulative 
records"; the method of sampling they are suggesting "draws on that practice and allows 
for the systematic collection and analysis of work." They claim that the PLR adds "a 
structured way of looking in depth at particular pieces of writing" (p. 50). In guiding these 
structured and in-depth looks at samples of student work, the PLR asks for the inclusion 
of: "1 Context and background information about the writing — 2 Child's own response 

to the writing 3 Teacher's response 4 Develo^pmcnt of spelling and conventions 

of writing 5 What this writing shows about the child's development as a writer" (pp. 

51-52). 

An example of a six-year-old boy's writing and the sample PLR entries about it 
make clear what the record contributes: 

One day annansi met hare and they went to a tree fooU of food annansi had 
tosing a little soing to get the rope and the rope did Not come dawn its self 
his mother dropt it dawn and he climb up it hoe towld hare not to tell but at 
ferst he did not tall but in a litde wille he did. 

He towUd eliphont and the tottos and the popuqin and the caml and they 
saing the little soing and dawn came the rope and they all clambd on it and 
the rope swuing rawnd and rawnd. 

and they all screcmd and thir screcmds wock Anansi up and he shawtdid to 
his mother it is not Anansi but robbers cut the rope. 



^In ihc d.K. the school year is divided into three terms: fall, winter, and summer. 



and she cut the rope and anmis fell and the elphent flatnd his fas and the 
totos crct his shell and the caml brocka bon in his humpe and pocupin brock 
all hispricls. (p. 51) 

The teacher writes first about the context and background of the stoiy: 

M. wrote this retelling after listening to the story on a story tape several 
times. Probably particularly interested in it because of the Caribbean stories 
told by storytellers who visited recently. Wrote the complete book in one 
go— took a whole morning. First draft, (p. 51) 

The child's response: 

Very pleased with it. He has talked a lot about the story since listening to 
the tape. (p. 51) 

The teacher's response: 

I was delighted. It's a very faithful retelling, revealing much detail and 
language. It's also a lengthy narrative for him to have coped with alone, 
(p. 51) 

About the student's developing control of spelling and conventions, the teacher continues: 

He has made excellent attempts at several unfamiliar words which he has 
only heard, not read, before. Apart from vowels in the middle of words he 
is getting close to standard spelling, (p. 51) 

Finally, about his general development, the teacher concludes: 

It is the longest thing he's done and the best in technical terms. He is happy 
with retelling and likes to have this support for his writing, but it would be 
nice to see him branching out with a story that is not a retelling soon, 
(p. 51) 

Basically, what the PLR provides is a guide to the teacher for commenting on student's 
work and for keeping a running record that can be accessed by others. The PLR, although 
more specific than any other writing on classroom portfolios, remains relatively vague. 
For example, the following is only guidance for the teacher response category of the PLR: 

Is the content interesting? What about the kind of writing— is the child 
using this form confidently? And finally, how does this piece strike you as 
a reader— what is your reaction to it? (p. 52) 

The PLR also does not suggest how qualitative comments could be systematically 
aggregated to provide information about anything other than individual development. 
Certainly, the push to create classroom portfolios has great potential for improving teaching 
and learning. And the records being kept might become useful to large -scale testers, if we 
could begin to figure out some sensible ways not just to collect but also to make use of the 
data for determining how well students can write, how effective our curriculum is. 

In the U.S. we are mostly at the stage of cxperimentmg with putting portfolio 
evaluation systems in place at the classroom and school level in sensible ways, without 
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wonying too much about their wider uses. However, the hope is, as Wolf writes, that 
portfolios will someday replace more traditional forms of large-scale assessment. Toward 
this end, a number of states have begun to support portfolio development work in school 
settings, basically allowing creative teachers and administrators to "mess around" with 
portfolios, tailoring them to local contexts, seeing what happens. For example, California 
has funded severed school-site efforts (see Murphy & Smith, 1990). In Alaska three 
districts are being funded to create integrated language arts portfolios: a high school in 
Fairbanks is having students put together portfolios to be judged as part of a graduation/exit 
test; a first-gnide classroom in Juneau is using portfolios instead of report cards and is also 
using them to determine gains for Chapter 1 programs and for decisions about promotion to 
grade 2; and two elementary school-wide projects are being put in place in Anchorage. ^ 

The state of Vermont is perhaps farther along than most others in conceptualizing a 
state-wide portfolio assessment program. The Vermont experience is showing how 
assessment goals and ciassroom reform can be coupled, and mutually supported; however, 
for now the coupling is more like an engagement than a marriage since the plan is still only 
a plan. A draft of the plan, Vermont Writing Assessment: THE PORTFOLIO (1989), 
announces: 

We have devised a plan for a state-wide writing assessment that we think is 

humane and that reinforces sound teaching practices As a community 

of learners, we want to discover, enhance and examine good writing in 
Vermont. As we design an assessment program, we hope to combine local 
common sense with the larger world of ideas . . . and people. . . . We 
believe that guiding students as writers is the responsibility of every teacher 
and administrator in the school and that members of the public have a right 
to know the results of our efforts, (p. 1) 

Vermont plans to assess all students in grades four and eleven. The plan has three parts. 
First, students will write one piece to an assigned and timed prompt which will be 
holistically scored. Second, with the help of their classroom teacher, students will select 
and submit a "best piece" from their classroom writing portfolio. This piece will be scored 
by the same teachers who evaluate the prompted sample. Finally, state evaluation teams 
will visit all schools "to review a sample of fourtii and cleventii grade portfolios" (p. 2). At 
this time the "teams will look at the range of content, die depth of revision and die student's 
willingness to take a risk" (p. 2). The idea is that "scores from die prompted sample and 
the best piece will indicate each student's writing abilities; portfolios will give a picture of 
tile school's writing program" (p. 2). 

For the classroom portfolios the Vermont draft plan advises tiiat students keep "all 
drafts of any piece die student wants included" (p. 3). The plan also advises schools to 
buy or clear storage cabinets. The idea is tiiat students will keep tiiis full "current-year 
folder" which will then be transferred to a permanent folder which will include a selected 
collection of tiie students* work from grades kindergarten through grade 12. The current 
year folder will contain a cover sheet much like that just described in tiie PLR. It will have 
space for teacher comments, instructions and goals for the students, and the state evaluation 
team's official comments, along with a grid/checklist for documenting the process of 



^Odier states implementing or experimenting with portfolio assessment include: Alaska, Arizona, 
California. Connecticut, Maryland, New Mexico, Oregon. Texas, and Rhode Island. States that have 
expressed interest but that do not yet have formal committees include: Arkansas. Nebraska, and Utah. This 
information was compiled through 1990 telephone interviews with ofTicials at each state department of 
education by Pamela Aschbacher of the Center for the Study of Evaluation at UCLA. 
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producing the portfolio work. For inclusion in the portfolio, the state team will likely 
lecommend a minimum set of pieces of varied types, either something expressive, 
imaginative, informative, persuasive, and formulaic (to fulfill social obligations) or 
alternatively a letter explaining the choices of work in the portfolio, or a piece about the 
process of composition, a piece of imaginative writing, a piece for any non-English 
curriculum area, and a personal written response to a book, current issue or the like. 

The plan for the teachers* evaluating of tlie portfolio follows: "To assess student 
portfolios, we propose asking teacher-cvaluators to answer a set of questions, using a 
format that allows for informal and formal portfolio reviews" (p. 13). The questions 
include both a scale, with a numerical score and a place for qualitative comments. For 
example, the first of the 14 scaled questions is: 

CHECK (FORMAL) 
BOXES GRADUATED TERMS HOLISTIC 

(INFORMAL) (INFORMAL) SCORE 



□ 1. DOES WRITINC f RFFT .RCT A SENSE 2345678 
OF A13THENTIC VOICE? 

□ Somewhat □ Consistently □ Extensively 

Other questions ask about audience awareness, logical sequence, syntax, and 
spelling as well as about the process the student used to produce the pieces and the folder, 
and about the coherence of the folder as a whole. The qualitative comment section is like 
the PLR only less elaborate, with only a space for general observations and another for 
recommendations. 

The Vermont plan is comprehensive and involves provision for teacher in-service in 
the collection and evaluation of student portfolios as well as for a state-wide evaluation that 
takes into account student writing produced under both natural and testing conditions. In 
addition, through the site teams, Vermont has a plan for evaluating programs at the school 
site level. Although still in the planning stages, Vermont seems to be leading tiie way in 
connecting teacher in-service and assessment witii the large-scale evaluation of wnting 
programs and testing of writing. This coordinated plan promises to provide information 
about the development of individual students, about school programs, and about wnting 
achievement in the state. 

As a final example of the large-scale use of portfolios, I want to turn to the national 
examination Uiat determines whetiier or not British students at age 16+,,thc end of U.S. 
tentii-gradc equivalent, will graduate from secondary school and receive the equivalent of a 
U.S. high school diploma. This British examination is called the General Certificate of 
Secondary Education (GCSE).^ If students receive high scores on the GCSE, they may go 
into a two-year course, die General Certificate of Education at Advanced Level, known as 
A levels. The A level courses qualify students for entry to universities and other forms of 
higher education. Also, some employers demand A levels. Over 60% of U.K. students do 
not take A levels but instead leave school at 16+, after taking die GCSE examination. The 
GCSE serves a major gatekeeping function in Great Britain. 




^The GCSE has replaced the system by which more able students, the lop 20-25%, were entered for the 
General CcrUficate of Education Ordinary level (0 level) and others took the CerUficatc of Secondary 
Education (CSE). 
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For the GCSE in language and literature, schools choose between either a timed 
examination at the end of the two years plus a folder of coursework (portfolios) or simply a 
folder of coursework. The important point is that the GCSE now contains coursework and 
is in large part or is completely a national, large-scale examination, based on portfolios of 
students' coursework. In the case of the English and language examinations, the 
coursework is writing. The specifications for the GCSE differ slightly according to five 
different examining boards in England and Wales. For the GCSE examination schools 
have a choice of affiliating with any one of the five boards, each with a different 
examination syllabus, i.e., format and organization for the examination as well as the 
course of study. 

For the coursework only option, students must complete 20 pieces of writing, ten 
of these for the English language examination and ten for the literature examination, the two 
examinations being separately assessed. The writing in the folder must be in a variety of 
functions, for a variety of purposes, and for different audiences (e.g., report, description, 
argument, and persuasion, narrative fiction, poems, response to texts), assembled over a 
two-year period (usually with the same teacher for both years of the examination course), 
on which the students' grades are totally based. Of the ten pieces for each examination, the 
student and teacher choose the five best pieces which cover the assessment objectives for 
each examination. These are the pieces which are finally evaluated. 

For this coursework only option, the assessment of the writing in the coursework 
folder is made by the student's teacher, by a committee of teachers in the school, and is 
checked and standardized nationally. The national standard-setting for portfolio marking is 
done somewhat differendy by the different examining boards, but the general plans are 
quite similar. A booklet produced by the NEA reports that representatives fh)m each 
school who are teachers and are involved in the national standard setting meet twice a year 
for trial marking sessions where they receive photocopies of scripts or portfolios entered 
by four students the previous year. The portfolios do not have grades, so the teachers 
decide the grade tiiey would give if the candidate was their student. The teachers submit 
their grades at a school meeting where the portfolios are discussed and a school grade 
agreed upon. Representatives from each school attend a consoitium trial marking meeting 
where portfolios and grades are discussed again. A member of the NEA's National 
Review Board attends this meeting and explains the grades the Board has given. After this 
training period a committee of teachers in the school agrees on grades for the coursework 
folders from that school (at least two teachers from the committee have to agree on the 
grade), and then the folders are sent to a review panel where the reviewers evaluate a 
sample from each school. If the National Board consistently disagrees with the evaluations 
from a school, all portfolios from that school are regraded. The fmal grade for the student 
is then sent back to the school. 

The important point is that the student's examination grades for language and then 
for literature are based on an evaluation for the set of pieces in that area in the rolder. Hie 
portfolio evaluation consists of a grade given for a group of pieces and is not derived from 
an average of grades on individual pieces. All assessors, including the National Review 
Panel, are practicing teachers. 

The GCSE is elaborate and standardized, both in the plan for marking the folders 
and in the plan for collecting the work that goes in them. The GCSE also shows the crucial 
role the teacher plays in the student's success on a portfolio evaluation. Teachers always 
play this role, of course, but portfolios place the responsibility inequivocably and directly 
in the teacher's lap. 
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The PLR, the Vermont plan, and the GCSE illustrate several ways that portfolio 
assessment can be used, with the assessment designs appropriately varied according to the 
functions they fulfill. Although the models of well-conceived, large-scale portfolio 
programs are few, they are certainly beginning to emerge, and they are marked by their 
thoughtful approach to students and to the evaluation of their work. 

Conclusions 

In the assessment of writing, the concept of the portfolio seems particularly 
appealing because writers, like artists, can collect rcpreseniative samples of their work that 
provide a sense of the range and quality of what they can do (Anson, BridwcU-Bowles, & 
Brown, 1988; Bumham, 1986; Camp, 1985a,b, 1990; Elbow & Belanoff, 1986a b; 
Fowles & Geniilc, 1989; Lucas. 1988a.b; Murphy & Smith, 1990; Simmons, 1990; 
Stiggins, 1988; Wolf, 1988, 1989a). Portfolios can be collected as part of an ongomg 
instructional program and get around the problem of one-shot evaluation procedures 
(Anson et al., 1988; Belanoff, 1985; Bumham, 1986; Calfee & Sutter-Baldwin, 1987; 
Calfcc & Hiebert, 1988; Camp, 1985a,b; Camp & Belanoff, 1987; Elbow, 1986; Elbow & 
Belanoff, 1986a,b; Fowles & Gentile, 1989; Lucas, 1988a,b; Murphy & Smith, 1990; 
Simmons, 1990; Valencia, McGinley, & Pearson, 1990; Wolf, 1988). Providing direction 
for large-scale portfolio efforts that could inform and be informed by classroom efforts is 
particularly important, since testing programs often exert powerful influences over the 
nature of instruction in writing and reflect "what counts" as literacy (Calfee & Hiebcrt, 
1988; Cooper, 1981a; Cooper & Murphy, in progress; Cooper & Odell, 1977; Diedench, 
1974; Loofbourrow, 1990; Mellon, 1975; Myers, 1980; Resnick & Resnick, 1977, 1990). 
There is an important role for teacher-driven and classroom-based assessment in our plans 
for educational reform . 

But I want to end with a word of warning. Currently, in the U.S. the National 
Assessment is experimenting with the collection of information from writing portfolios. 
Preliminary results arc showing that when a random group of teachers are just asked to 
submit student work, called portfolios, without the accompanying staff development and 
professional activities outlined in most of the programs I have described, the writing that 
they submit is rather dismal. As the careful work of the Pittsburgh Arts PROPEL project 
shows, just collecting and eval jating portfolios will solve neither our assessment 
problems, nor our need to create a professional climate in our schools. By coupling 
assessment and instruction in increasingly sophisticated ways, we may be able to make a 
real difference in education in this country. What I have offered here is an overview of 
writing assessment and some examples of programs that might stimulate us to think about 
new directions. 
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